Hierarchical task mapping

ABSTRACT

Mapping tasks to physical processors in parallel computing system may include partitioning tasks in the parallel computing system into groups of tasks, the tasks being grouped according to their communication characteristics (e.g., pattern and frequency); mapping, by a processor, the groups of tasks to groups of physical processors, respectively; and fine tuning, by the processor, the mapping within each of the groups.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/413,286, filed Mar. 6, 2012, which claims the benefit of U.S.Provisional Application No. 61/470,232 filed on Mar. 31, 2011. Thecontents of those applications are incorporated by reference herein intheir entirety.

FIELD

The present application relates generally to computers, and computerapplications, parallel computing and more particularly to task mappingin parallel computing systems.

BACKGROUND

As the high performance computing systems scale up, mapping the tasks ofa parallel application onto physical processors to allow efficientcommunication becomes one of the challenging problems. Many mappingtechniques have been developed to improve the application communicationperformance. First, graph embedding has been studied and applied tooptimize very large scale integrated (VLSI) circuits. See, e.g., John A.Ellis. Embedding rectangular grids into square grids. IEEE Trans.Comput., 40(1):46-52, 1991.; Rami G. Melhem and Ghil-Young Hwang.Embedding rectangular grids into square grids with dilation two. IEEETrans. Comput., 39(12):1446-1455, 1990. The graph embedding for VLSIcircuits tries to minimize the longest path.

Second, space filling curves (See, e.g., Space-Filling Curves.Springer-Verlag, 1994) are applied to map parallel programs ontoparallel computing systems. The use of space filling curves to improveproximity for mapping is well studied and has found useful in parallelcomputing. The paper, Masood Ahmed and Shahid Bokhari. Mapping withspace filling surfaces. IEEE Trans. Parallel Distrib. Syst.,18:1258-1269, September 2007, extends the concept of space fillingcurves to space filling surfaces. It describes three different classesof space filling surfaces and calculates the distance between facets.

There are methods using graph-partitioning and search-based optimizationto solve the mapping problem. For example, G. Bhanot, A. Gara, P.Heidelberger, E. Lawless, J. C. Sexton,and R. Walkup. Optimizing tasklayout on the blue gene/l supercomputer. IBM Journal of Research andDevelopment, 49(2):489-500, March 2005, uses an off-line simulatedannealing to explore different mappings on Blue Gene/L™.

The work in Hao Yu, I-Hsin Chung, and Jose Moreira. Topology mapping forblue gene/l supercomputer. In SC '06: Proceedings of the 2006 ACM/IEEEconference on Supercomputing, page 116, New York, N.Y., USA, 2006. ACM,developed topology mapping libraries. The mapping techniques are basedon folding heuristics. The methods based on folding heuristics requirethe topologies for guest and host are known already.

Recently, new mapping techniques have been developed. See, e.g., AbhinavBhatel´e, Eric Bohm, and Laxmikant V. Kal´e. A case study ofcommunication optimizations on 3d mesh interconnects. In Euro-Par '09:Proceedings of the 15th International Euro-Par Conference on ParallelProcessing, pages 1015-1028, Berlin, Heidelberg, 2009. Springer-Verlag.

In terms of supporting message passing interface (MPI) topologyfunctions, there are works done for specific systems: Jesper LarssonTr{umlaut over ( )}aff. Implementing the mpi process topology mechanism.In Supercomputing, pages 1-14, 2002, uses graph-partitioning based forembedding and Sangman Moh, Chansu Yu, Dongsoo Han, Hee Yong Youn, andBen Lee. Mapping strategies for switch-based cluster systems ofirregular topology. In 8th IEEE International Conference on Parallel andDistributed Systems, Kyongju City, Korea, June 2001, describes embeddingtechniques for switch-based network.

BRIEF SUMMARY

A method for mapping tasks to physical processors in parallel computingsystem in a hierarchical manner may be provided. In one aspect, a methodfor mapping tasks to physical processors in parallel computing mayinclude partitioning tasks in the parallel computing system into groupsof tasks, the tasks being grouped according to their communicationpattern and frequency. The method may also include mapping the groups oftasks to groups of physical processors, respectively. The method mayfurther include fine tuning the mapping of tasks to processors withineach of the groups.

A system for mapping tasks to physical processors in parallel computingsystem, in one aspect, may include a module operable to execute on aprocessor, and further operable to partition tasks in the parallelcomputing system into groups of tasks, the tasks being grouped accordingto their communication pattern and frequency. The module may be furtheroperable to map the groups of tasks to groups of physical processors,respectively. The module may be further operable to fine tune themapping of tasks to processors within each of the groups.

A computer readable storage medium storing a program of instructionsexecutable by a machine to perform one or more methods described hereinalso may be provided.

Further features as well as the structure and operation of variousembodiments are described in detail below with reference to theaccompanying drawings. In the drawings, like reference numbers indicateidentical or functionally similar elements.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a flow diagram illustrating a method of the present disclosurein one embodiment.

FIG. 2 illustrates an example of the Moore's space filling curveconstructed recursively.

FIGS. 3A and 3B show two examples of mapping 64 tasks into a 4×4×4 cube.

FIG. 4 illustrates a schematic of an example computer or processingsystem that may implement the task mapping system in one embodiment ofthe present disclosure.

DETAILED DESCRIPTION

In the present disclosure in one embodiment, a hierarchical mappingalgorithm is disclosed. The hierarchical mapping algorithm in oneembodiment includes partitioning tasks into groups of related tasks,assigning the groups of tasks to groups of physical processors andrefining or fine tuning the assignments within each group. A group oftasks are referred to as a supernode.

In one embodiment of the present disclosure, the tasks that are groupedtogether are related by the frequency of communication among oneanother. For instance, the hierarchical mapping algorithm may partitionthe tasks by utilizing a run-time communication matrix to preserve thelocality of communication. The algorithm then may extend the Moore'sspace filling curve on the task partitions for global mapping. Eachpartition may be further fine tuned using local search method to improvethe communication performance.

The method of the present disclosure in one embodiment tries to preservethe locality and reduce the communication time. An example of the methodof the present disclosure in one embodiment further extends the effortsof the known methods so the mapping can be handled efficiently on largescale systems while run-time communication performance data is takeninto consideration. In one embodiment, the method of the presentdisclosure may use heuristics with better initial mapping and exploredifferent mappings in parallel in different supernodes. In addition tothe folding heuristics, the method of the present disclosure in oneembodiment may integrate the run-time measurement into mappingconsideration. The methods based on folding heuristics require previousknowledge of the topologies for guest and host. The method of thepresent disclosure in one embodiment may be based on run-timemeasurements, which allows mapping to be done more dynamically.

The hierarchical mapping algorithm of the present disclosure in oneembodiment may first group nearby tasks that are related, e.g.,frequently communicate with each other based on MPI trace collectedduring run-time, into a “supernode”. Similar measurement (e.g.,bandwidth and latency) and grouping are done for the physicalprocessors. Then in the global mapping, we apply mapping methods such asthe Moore's space filling curve to map the supernodes onto the processorgroups on the host machine. After the supernodes are mapped onto thehost machine, we swap the tasks within a supernode to explore bettermapping configurations (which can be done in parallel). Moore's spacefilling curve is described in Eliakim Hastings Moore. On certain crinklycurves. Transactions of the American Mathematical Society, 1(1):72-90,1900.

In one embodiment, the local search method described in Jon Kleinbergand Eva Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co.,Inc., Boston, Mass., USA, 2005, may be used as an optimization techniquefor swapping tasks within a supernode. The technique is effective insolving NP-hard problems. In one embodiment of the present disclosure,to make the method more efficient, tasks in a supernode are classifiedinto two types: the boundary tasks and the interior tasks. The boundarytasks are selected to maintain the continuity; and the interior tasksare swapped with greedy method to explore possible improvements.

When the number of tasks increases, finding the optimal mapping forthose two kinds of problems becomes a challenge since tuning cannot bedone by hand. Considering the scalability and the automation, thepresent disclosure proposes the hierarchical mapping algorithm. In oneembodiment, the algorithm may include three parts: the task partition,the global mapping, and the local tuning. In the task partition, thealgorithm evenly groups tasks that have strong relations into“supernodes”. In the global mapping, those “supernodes” are mapped ontoprocessor groups of the host machine. The mapping is fine tuned locallyby optimization methods. Hierarchical mapping in the present disclosurerefers to grouping of tasks into supernodes, mapping groups and thenfine tuning within a group.

FIG. 1 is a flow diagram illustrating a method of the present disclosurein one embodiment. At 102, tasks are partitioned into groups of tasks,each group into a supernode. At 104, global mapping is performed toassign the partitioned tasks, or supernodes, to physical processors of ahost machine or target machine. At 106, the mapping of tasks toprocessors within a supernode is fine tuned. Each of the steps isexplained in further detail below.

Task Partition

The task partition may be done via the analysis of the communicationpattern, represented by a matrix, e.g., matrix A where a_(ij) representsthe size of data transferred between MPI rank i and j. In MPIprogramming, rank refers to a unique identifier (ID) given to a task.The matrix may be collected during run-time using a MPI tracing toolsuch as the one described in H. Wen, S. Sbaraglia, S. Seelam, I. Chung,G. Cong, and D. Klepacki. A productivity centered tools framework forapplication performance tuning. In QEST '07: Proceedings of the FourthInternational Conference on the Quantitative Evaluation of Systems (QEST2007), pages 273-274, Washington, D.C., USA, 2007. IEEE ComputerSociety. The task partition problem is transformed into the problem offinding the blocks of sparse matrix, for example, described in RichardVuduc and Hyun-Jin Moon. Fast sparse matrix-vector multiplication byexploiting variable blocks. In Proceedings of the InternationalConference on High-Performance Computing and Communications, 2005. Thetask partition explores the structure of the nonzero elements. Thematrix is partitioned into four submatrices, two in each dimension. Ifthe ratio between nonzero and zero elements in a submatrix exceeds somethreshold, the partition stops. Otherwise, the partition will continuerecursively, until to some preset block size. This procedure is doneautomatically.

There are cases when the user with domain knowledge may know the problemproperties. Then the task partition can be decided directly by the user.For instance, if there are block structures coming naturally from theproblem, then the tasks should be partitioned according to thestructure. Another partition example that comes naturally is when theMPI problem uses different MPI communicators for different tasks. Thisapproach gives the user more freedom to choose proper blocks, since ablock can be formed by the elements across the entire matrix, not justby the adjacent elements.

Global Mapping

The global mapping works on supernodes in one embodiment of the presentdisclosure. The dimension of the supernode is used as a unit to measurethe dimension of the host machine. For instance, suppose the number oftasks in a supernode is 16, and the topology of host machine is an 8×8×8cube. If the dimension of a supernode is decided to be 2×2×4, then theproblem becomes mapping 32 supernodes onto a 4×4×2 cube.

For the mapping of a ring or a chain of supernodes to the reduced hostmachine, the Moore's space filling curve may be used. The space fillingcurves can be constructed recursively, which means it has hierarchicalstructure, as shown in FIG. 2. Also, many applications use periodicboundary condition, which makes the communication pattern as a ring or atorus. Moore's space filling curve can map a ring to a square or to acube, which is more versatile than other kinds of space filling curves.We extend its idea to allow the host space to be rectangular.

High Dimensional Mapping

In this section, we demonstrate how the high dimensional mapping problemcan be solved by using the hierarchical mapping algorithm. For thesimplicity of illustration, we only use the problem of mapping a twodimensional mesh (or a torus) into a cube as examples. However, thisidea can be extended to solve higher dimensional problems.

In the task partition step, the tasks in one side of a mesh (or a torus)are partitioned into a supernode. In the global mapping step, the chain(or the ring) of supernodes is then stuffed into the host machine. Theidea is just like rolling the mesh into a tube (or a torus), in which asupernode is formed by the tasks along a circumference, and then tostuff the tube into a box.

FIGS. 3A and 3B show two examples of mapping 64 tasks into a 4×4×4 cube.Each point of intersections of lines (corner) represents a processor.The first example, shown in FIG. 3A, is for an 8×8 torus. When it isrolled into a tube, each supernode is of size 8. If the dimension of asupernode is set to 4×2×1, the problem becomes putting an 8 node longring onto a 2×4 plane, which can be done straightforwardly. Theconceptual mapped torus is shown in FIG. 3A. FIG. 3B shows the exampleof mapping a 4×16 torus into a 4×4×4 cube. If the mesh is rolled fromthe short side, it becomes a tube of circumference 4 and 16 supernodelong. Since each supernode is of dimension 2×2×1, the global mappingproblem becomes stuffing a 16 node long ring into a 2×2×4 cube. Usingthe space filling curve for three dimensional space, one can obtain themapping like shown in FIG. 3B.

This mapping approach may encounter problems when the tube is turnedaround the corner. Similar problems had been studied in Hao Yu, I-HsinChung, and Jose Moreira. Topology mapping for blue gene/l supercomputer.In SC '06: Proceedings of the 2006 ACM/IEEE conference onSupercomputing, page 116, New York, N.Y., USA, 2006, ACM, in which thenodes around the corners are twisted to minimize the dilation distances.The same technique may be used for that. However, the mapping is furtherevaluated and improved by optimization methods.

The Local Tuning

The local tuning step of the hierarchical mapping fine tunes the mappingby local swapping. The framework of the local tuning is sketched asfollows.

-   Given an initial mapping φ, compute the evaluation function C(φ).-   For k=1, 2, . . . until C(φ) converges

a) Propose a new φ′.

b) Evaluate C(φ′).

c) If C(φ′)<ρ_(k)C(φ), φ=φ′.

In the framework, three things can be varied. The first is thedefinition of the evaluation function; the second is the method ofproposing new φ′; and the third is the decision of the parameter ρ_(k).Many optimization methods are conformable to this framework, such as thelocal search algorithm and the simulated annealing method. The idea isto find a better mapping from the existing one. In the presentdisclosure in one embodiment, we use the simple local search algorithm,which fixes ρ_(k)=1, and proposes new φ′ by swapping tasks with itsneighbors. The used evaluation function C(φ) is defined as follows.

When the message size of communication is taken into consideration, thedilation distance may not be the best metric for a mapping. Here wepropose a new metric, called communication cost, to measure the qualityof mappings. The communication cost is composed by two factors: thetraffic pattern of tasks and the processor distance.

The traffic pattern of tasks is modeled by a traffic matrix, e.g.,matrix T, whose element Ti,j represents the message size sent from taski and task j. The content of matrix T can be obtained from the analysisof programs, in which function calls for communications, such asMPI_SEND or MPI_REDUCE, provides the hints of traffic pattern andmessage size. A more expensive, but more robust way to obtain T is frommeasurement of the sample execution of programs.

The processor distance is also represented by a matrix, e.g., matrix D.Element D(i, j) is the cost, which may mean time taken, of sending aunit message from processor i to processor j. A simple model toformulate the matrix D is by the number of links on the shortest pathbetween two processors, which is also called the hopping distance. Formore accurate measurement, matrix D can be evaluated via experiments.

With the traffic matrix T and the distance matrix D, the communicationcost of a mapping is defined as

${{C(\varphi)} = {\sum\limits_{i,{j = 1}}^{n}\; {{T\left( {i,j} \right)}{D\left( {{\varphi (i)},{\varphi (j)}} \right)}}}},$

which is the summation of the communication time over all pairs of tasksmapped to the host machine.

FIG. 4 illustrates a schematic of an example computer or processingsystem that may implement the task mapping system in one embodiment ofthe present disclosure. The computer system is only one example of asuitable processing system and is not intended to suggest any limitationas to the scope of use or functionality of embodiments of themethodology described herein. The processing system shown may beoperational with numerous other general purpose or special purposecomputing system environments or configurations. Examples of well-knowncomputing systems, environments, and/or configurations that may besuitable for use with the processing system shown in FIG. 4 may include,but are not limited to, personal computer systems, server computersystems, thin clients, thick clients, handheld or laptop devices,multiprocessor systems, microprocessor-based systems, set top boxes,programmable consumer electronics, network PCs, minicomputer systems,mainframe computer systems, and distributed cloud computing environmentsthat include any of the above systems or devices, and the like.

The computer system may be described in the general context of computersystem executable instructions, such as program modules, being executedby a computer system. Generally, program modules may include routines,programs, objects, components, logic, data structures, and so on thatperform particular tasks or implement particular abstract data types.The computer system may be practiced in distributed cloud computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. In a distributed cloudcomputing environment, program modules may be located in both local andremote computer system storage media including memory storage devices.

The components of computer system may include, but are not limited to,one or more processors or processing units 12, a system memory 16, and abus 14 that couples various system components including system memory 16to processor 12. The processor 12 may include a ask mapping module 10that performs the methods described herein. The module 10 may beprogrammed into the integrated circuits of the processor 12, or loadedfrom memory 16, storage device 18, or network 24 or combinationsthereof.

Bus 14 may represent one or more of any of several types of busstructures, including a memory bus or memory controller, a peripheralbus, an accelerated graphics port, and a processor or local bus usingany of a variety of bus architectures. By way of example, and notlimitation, such architectures include Industry Standard Architecture(ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA)bus, Video Electronics Standards Association (VESA) local bus, andPeripheral Component Interconnects (PCI) bus.

Computer system may include a variety of computer system readable media.Such media may be any available media that is accessible by computersystem, and it may include both volatile and non-volatile media,removable and non-removable media.

System memory 16 can include computer system readable media in the formof volatile memory, such as random access memory (RAM) and/or cachememory or others. Computer system may further include otherremovable/non-removable, volatile/non-volatile computer system storagemedia. By way of example only, storage system 18 can be provided forreading from and writing to a non-removable, non-volatile magnetic media(e.g., a “hard drive”). Although not shown, a magnetic disk drive forreading from and writing to a removable, non-volatile magnetic disk(e.g., a “floppy disk”), and an optical disk drive for reading from orwriting to a removable, non-volatile optical disk such as a CD-ROM,DVD-ROM or other optical media can be provided. In such instances, eachcan be connected to bus 14 by one or more data media interfaces.

Computer system may also communicate with one or more external devices26 such as a keyboard, a pointing device, a display 28, etc.; one ormore devices that enable a user to interact with computer system; and/orany devices (e.g., network card, modem, etc.) that enable computersystem to communicate with one or more other computing devices. Suchcommunication can occur via Input/Output (I/O) interfaces 20.

Still yet, computer system can communicate with one or more networks 24such as a local area network (LAN), a general wide area network (WAN),and/or a public network (e.g., the Internet) via network adapter 22. Asdepicted, network adapter 22 communicates with the other components ofcomputer system via bus 14. It should be understood that although notshown, other hardware and/or software components could be used inconjunction with computer system. Examples include, but are not limitedto: microcode, device drivers, redundant processing units, external diskdrive arrays, RAID systems, tape drives, and data archival storagesystems, etc.

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent invention may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages, a scripting language such as Perl, VBS or similarlanguages, and/or functional languages such as Lisp and ML andlogic-oriented languages such as Prolog. The program code may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider).

Aspects of the present invention are described with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The computer program product may comprise all the respective featuresenabling the implementation of the methodology described herein, andwhich—when loaded in a computer system—is able to carry out the methods.Computer program, software program, program, or software, in the presentcontext means any expression, in any language, code or notation, of aset of instructions intended to cause a system having an informationprocessing capability to perform a particular function either directlyor after either or both of the following: (a) conversion to anotherlanguage, code or notation; and/or (b) reproduction in a differentmaterial form.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements, if any, in the claims below areintended to include any structure, material, or act for performing thefunction in combination with other claimed elements as specificallyclaimed. The description of the present invention has been presented forpurposes of illustration and description, but is not intended to beexhaustive or limited to the invention in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the invention.The embodiment was chosen and described in order to best explain theprinciples of the invention and the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

Various aspects of the present disclosure may be embodied as a program,software, or computer instructions embodied in a computer or machineusable or readable medium, which causes the computer or machine toperform the steps of the method when executed on the computer,processor, and/or machine. A program storage device readable by amachine, tangibly embodying a program of instructions executable by themachine to perform various functionalities and methods described in thepresent disclosure is also provided.

The system and method of the present disclosure may be implemented andrun on a general-purpose computer or special-purpose computer system.The terms “computer system” and “computer network” as may be used in thepresent application may include a variety of combinations of fixedand/or portable computer hardware, software, peripherals, and storagedevices. The computer system may include a plurality of individualcomponents that are networked or otherwise linked to performcollaboratively, or may include one or more stand-alone components. Thehardware and software components of the computer system of the presentapplication may include and may be included within fixed and portabledevices such as desktop, laptop, and/or server. A module may be acomponent of a device, software, program, or system that implements some“functionality”, which can be embodied as software, hardware, firmware,electronic circuitry, or etc.

The embodiments described above are illustrative examples and it shouldnot be construed that the present invention is limited to theseparticular embodiments. Thus, various changes and modifications may beeffected by one skilled in the art without departing from the spirit orscope of the invention as defined in the appended claims.

1. A computer readable storage medium storing a program of instructions executable by a machine to perform a method of mapping tasks to physical processors in parallel computing system, comprising: partitioning tasks in the parallel computing system into groups of tasks, the tasks being grouped according to their communication pattern and frequency; mapping, by a processor, the groups of tasks to groups of physical processors, respectively; and fine tuning, by the processor, the mapping of tasks to processors within each of the groups.
 2. The computer readable storage medium of claim 1, wherein the partitioning is performed by utilizing a run-time communication matrix collected during run-time of the tasks, based on message passing interface communications occurring among the tasks, wherein the partitioning preserves locality of communication among the tasks.
 3. The computer readable storage medium of claim 2, wherein the tasks that communicated with one another a predetermined number of times are partitioned into same group.
 4. The computer readable storage medium of claim 1, wherein the tasks in each of the groups are classified into boundary tasks and interior tasks, wherein the boundary tasks are selected to maintain continuity of communication among said groups of tasks, and wherein the interior tasks can be swapped in fine tuning the mapping of tasks to processors within each of the groups.
 5. The computer readable storage medium of claim 1, wherein the mapping can be performed based on Moore's space filling curve technique.
 6. The computer readable storage medium of claim 1, wherein the fine tuning includes swapping assignment of tasks to physical processors within said each of the groups.
 7. The computer readable storage medium of claim 6, wherein the swapping is performed based on a local search method, wherein the tasks are classified into boundary tasks and interior tasks, the boundary tasks selected to maintain continuity and the interior tasks are swapped based on a greedy algorithm.
 8. The computer readable storage medium of claim 7, wherein communication cost of mapping including summation of communication time over all pairs of tasks is used to determine swapping.
 9. The computer readable storage medium of claim 8, wherein said communication cost is determined based on runtime measurements of the tasks.
 10. A system for mapping tasks to physical processors in parallel computing system, comprising: a processor; a module operable to execute on the processor, and further operable to partition tasks in the parallel computing system into groups of tasks, the tasks being grouped according to their communication pattern and frequency, the module further operable to map the groups of tasks to groups of physical processors, respectively, the module further operable to fine tune the mapping of tasks to processors within each of the groups.
 11. The system of claim 10, wherein the module partitions the tasks by utilizing a run-time communication matrix collected during run-time of the tasks, based on message passing interface communications occurring among the tasks, wherein the partitioning preserves locality of communication among the tasks.
 12. The system of claim 10, wherein the tasks in each of the groups are classified into boundary tasks and interior tasks, wherein the boundary tasks are selected to maintain continuity of communication among said groups of tasks, and wherein the interior tasks can be swapped in fine tuning the mapping of tasks to processors within each of the groups.
 13. The system of claim 10, wherein the fine tuning includes swapping assignment of tasks to physical processors within said each of the groups.
 14. The system of claim 13, wherein communication cost of mapping including summation of communication time over all pairs of tasks is used to determine swapping.
 15. The system of claim 14, wherein said communication cost is determined based on runtime measurements of the tasks. 